Load packages
library(tidyr)
library(dplyr)
library(tibble)
library(pillar)
library(stringr)
library(brms)
options(brms.backend = "cmdstanr", mc.cores = 2)
library(posterior)
options(pillar.negative = FALSE)
library(loo)
library(priorsense)
library(ggplot2)
library(bayesplot)
theme_set(bayesplot::theme_default(base_family = "sans"))
library(tidybayes)
library(ggdist)
library(patchwork)
library(RColorBrewer)
SEED <- 48927 # set random seed for reproducability
This notebook contains several examples of how to use Stan in R with brms. This notebook assumes basic knowledge of Bayesian inference and MCMC. The examples are related to Bayesian data analysis course.
Toy data with sequence of failures (0) and successes (1). We would like to learn about the unknown probability of success.
data_bern <- data.frame(y = c(1, 1, 1, 0, 1, 1, 1, 0, 1, 0))
As usual in case of generalizd linear models, (GLMs) brms defines the priors on the latent model parameters. With Bernoulli the default link function is logit, and thus the prior is set on logit(theta). As there are no covariates logit(theta)=Intercept. The brms default prior for Intercept is student_t(3, 0, 2.5), but we use student_t(7, 0, 1.5) which is close to logistic distribution, and thus makes the prior near-uniform for theta. We can simulate from these priors to check the implied prior on theta. We next compare the result to using normal(0, 1) prior on logit probability. We visualize the implied priors by sampling from the priors.
data.frame(theta = plogis(ggdist::rstudent_t(n=20000, df=3, mu=0, sigma=2.5))) |>
mcmc_hist() +
xlim(c(0,1)) +
labs(title='Default brms student_t(3, 0, 2.5) prior on Intercept')
data.frame(theta = plogis(ggdist::rstudent_t(n=20000, df=7, mu=0, sigma=1.5))) |>
mcmc_hist() +
xlim(c(0,1)) +
labs(title='student_t(7, 0, 1.5) prior on Intercept')
Almost uniform prior on theta could be obtained also with normal(0,1.5)
data.frame(theta = plogis(rnorm(n=20000, mean=0, sd=1.5))) |>
mcmc_hist() +
xlim(c(0,1)) +
labs(title='normal(0, 1.5) prior on Intercept')
Formula y ~ 1 corresponds to a model $() =
#\alpha\times 1 = \alpha$. `brms? denotes the $\alpha$ as `Intercept`.
fit_bern <- brm(y ~ 1, family = bernoulli(), data = data_bern,
prior = prior(student_t(7, 0, 1.5), class='Intercept'),
seed = SEED, refresh = 0)
Check the summary of the posterior and inference diagnostics.
fit_bern
Family: bernoulli
Links: mu = logit
Formula: y ~ 1
Data: data_bern (Number of observations: 10)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.76 0.64 -0.43 2.09 1.00 1734 1726
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Extract the posterior draws
draws <- as_draws_df(fit_bern)
We can get summary information using summarise_draws()
draws |>
subset_draws(variable='b_Intercept') |>
summarise_draws()
# A tibble: 1 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 b_Intercept 0.763 0.746 0.641 0.636 -0.242 1.90 1.00 1734. 1726.
We can compute the probability of success by using plogis which is equal to inverse-logit function
draws <- draws |>
mutate_variables(theta=plogis(b_Intercept))
Summary of theta by using summarise_draws()
draws |>
subset_draws(variable='theta') |>
summarise_draws()
# A tibble: 1 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 theta 0.668 0.678 0.130 0.134 0.440 0.870 1.00 1734. 1726.
Histogram of theta
mcmc_hist(draws, pars='theta') +
xlab('theta') +
xlim(c(0,1))
Prior and likelihood sensitivity plot shows posterior density estimate depending on amount of power-scaling. Overlapping line indicate low sensitivity and wider gaps between line indicate greater sensitivity.
theta <- draws |>
subset_draws(variable='theta')
powerscale_sequence(fit_bern, prediction = \(x, ...) theta) |>
powerscale_plot_dens(variables='theta') +
# switch rows and cols
facet_grid(rows=vars(.data$variable),
cols=vars(.data$component)) +
# cleaning
ggtitle(NULL,NULL) +
labs(x='theta', y=NULL) +
scale_y_continuous(breaks=NULL) +
theme(axis.line.y=element_blank(),
strip.text.y=element_blank()) +
xlim(c(0,1))
We can summarise the prior and likelihood sensitivity using cumulative Jensen-Shannon distance.
powerscale_sensitivity(fit_bern, prediction = \(x, ...) theta)$sensitivity |>
filter(variable=='theta') |>
mutate(across(where(is.double), ~num(.x, digits=2)))
# A tibble: 1 × 4
variable prior likelihood diagnosis
<chr> <num:.2!> <num:.2!> <chr>
1 theta 0.04 0.11 -
Instead of sequence of 0’s and 1’s, we can summarize the data with the number of trials and the number successes and use Binomial model. The prior is specified in the ‘latent space’. The actual probability of success, theta = plogis(alpha), where plogis is the inverse of the logistic function.
Binomial model with the same data and prior
data_bin <- data.frame(N = c(10), y = c(7))
Formula y | trials(N) ~ 1 corresponds to a model \(\mathrm{logit}(\theta) = \alpha\), and the number of trials for each observation is provided by | trials(N)
fit_bin <- brm(y | trials(N) ~ 1, family = binomial(), data = data_bin,
prior = prior(student_t(7, 0,1.5), class='Intercept'),
seed = SEED, refresh = 0)
Check the summary of the posterior and inference diagnostics.
fit_bin
Family: binomial
Links: mu = logit
Formula: y | trials(N) ~ 1
Data: data_bin (Number of observations: 1)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.75 0.64 -0.47 2.07 1.00 1699 1508
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
The diagnostic indicates prior-data conflict, that is, both prior and likelihood are informative. If there is true strong prior information that would justify the normal(0,1) prior, then this is fine, but otherwise more thinking is required (goal is not adjust prior to remove diagnostic warnings withoyt thinking). In this toy example, we proceed with this prior.
Extract the posterior draws
draws <- as_draws_df(fit_bin)
We can get summary information using summarise_draws()
draws |>
subset_draws(variable='b_Intercept') |>
summarise_draws()
# A tibble: 1 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 b_Intercept 0.749 0.741 0.635 0.620 -0.266 1.87 1.00 1699. 1508.
We can compute the probability of success by using plogis which is equal to inverse-logit function
draws <- draws |>
mutate_variables(theta=plogis(b_Intercept))
Summary of theta by using summarise_draws()
draws |>
subset_draws(variable='theta') |>
summarise_draws()
# A tibble: 1 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 theta 0.665 0.677 0.130 0.131 0.434 0.866 1.00 1699. 1508.
Histogram of theta
mcmc_hist(draws, pars='theta') +
xlab('theta') +
xlim(c(0,1))
Re-run the model with a new data dataset without recompiling
data_bin <- data.frame(N = c(5), y = c(4))
fit_bin <- update(fit_bin, newdata = data_bin)
Check the summary of the posterior and inference diagnostics.
fit_bin
Family: binomial
Links: mu = logit
Formula: y | trials(N) ~ 1
Data: data_bin (Number of observations: 1)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 1.06 0.90 -0.57 3.01 1.00 1484 1593
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Extract the posterior draws
draws <- as_draws_df(fit_bin)
We can get summary information using summarise_draws()
draws |>
subset_draws(variable='b_Intercept') |>
summarise_draws()
# A tibble: 1 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 b_Intercept 1.06 0.990 0.903 0.896 -0.306 2.60 1.00 1484. 1593.
We can compute the probability of success by using plogis which is equal to inverse-logit function
draws <- draws |>
mutate_variables(theta=plogis(b_Intercept))
Summary of theta by using summarise_draws()
draws |>
subset_draws(variable='theta') |>
summarise_draws()
# A tibble: 1 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 theta 0.712 0.729 0.158 0.171 0.424 0.931 1.00 1484. 1593.
Histogram of theta
mcmc_hist(draws, pars='theta') +
xlab('theta') +
xlim(c(0,1))
An experiment was performed to estimate the effect of beta-blockers on mortality of cardiac patients. A group of patients were randomly assigned to treatment and control groups:
Data, where grp2 is an indicator variable defined as a factor type, which is useful for categorical variables.
data_bin2 <- data.frame(N = c(674, 680),
y = c(39,22),
grp2 = factor(c('control','treatment')))
To analyse whether the treatment is useful, we can use Binomial model for both groups and compute odds-ratio. To recreate the model as two independent (separate) binomial models, we use formula y | trials(N) ~ 0 + grp2, which corresponds to a model \(\mathrm{logit}(\theta) = \alpha \times 0 + \beta_\mathrm{control}\times x_\mathrm{control} + \beta_\mathrm{treatment}\times x_\mathrm{treatment} = \beta_\mathrm{control}\times x_\mathrm{control} + \beta_\mathrm{treatment}\times x_\mathrm{treatment}\), where \(x_\mathrm{control}\) is a vector with 1 for control and 0 for treatment, and \(x_\mathrm{treatemnt}\) is a vector with 1 for treatemnt and 0 for control. As only of the vectors have 1, this corresponds to separate models \(\mathrm{logit}(\theta_\mathrm{control}) = \beta_\mathrm{control}\) and \(\mathrm{logit}(\theta_\mathrm{treatment}) = \beta_\mathrm{treatment}\). We can provide the same prior for all \(\beta\)’s by setting the prior with class='b'. With prior student_t(7, 0,1.5), both \(\beta\)’s are shrunk towards 0, but independently.
fit_bin2 <- brm(y | trials(N) ~ 0 + grp2, family = binomial(), data = data_bin2,
prior = prior(student_t(7, 0,1.5), class='b'),
seed = SEED, refresh = 0)
Check the summary of the posterior and inference diagnostics. brms is using the first factor level control as the baseline and thus reports the coefficient (population-level effect) for treatment (shown s grp2treatment) Check the summary of the posterior and inference diagnostics. With ~ 0 + grp2 there is no Intercept and and are presented as grp2control and grp2treatment.
fit_bin2
Family: binomial
Links: mu = logit
Formula: y | trials(N) ~ 0 + grp2
Data: data_bin2 (Number of observations: 2)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
grp2control -2.78 0.17 -3.13 -2.45 1.00 2406 2661
grp2treatment -3.38 0.21 -3.80 -2.99 1.00 3386 2390
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Compute theta for each group and the odds-ratio. brms uses bariable names b_grp2control and b_grp2treatment for \(\beta_\mathrm{control}\) and \(\beta_\mathrm{treatment}\) respectively.
draws_bin2 <- as_draws_df(fit_bin2) |>
mutate(theta_control = plogis(b_grp2control),
theta_treatment = plogis(b_grp2treatment),
oddsratio = (theta_treatment/(1-theta_treatment))/(theta_control/(1-theta_control)))
Plot oddsratio
mcmc_hist(draws_bin2, pars='oddsratio') +
scale_x_continuous(breaks=seq(0.2,1.6,by=0.2))+
geom_vline(xintercept=1, linetype='dashed')
Probability that the oddsratio<1
draws_bin2 |>
mutate(poddsratio = oddsratio<1) |>
subset(variable='poddsratio') |>
summarise_draws(mean, mcse_mean)
# A tibble: 1 × 3
variable mean mcse_mean
<chr> <dbl> <dbl>
1 poddsratio 0.984 0.00234
oddsratio 95% posterior interval
draws_bin2 |>
subset(variable='oddsratio') |>
summarise_draws(~quantile(.x, probs = c(0.025, 0.975)), ~mcse_quantile(.x, probs = c(0.025, 0.975)))
# A tibble: 1 × 5
variable `2.5%` `97.5%` mcse_q2.5 mcse_q97.5
<chr> <dbl> <dbl> <dbl> <dbl>
1 oddsratio 0.320 0.928 0.00381 0.0149
Make prior sensitivity analysis by power-scaling both prior and likelihood. Focus on oddsratio which is the quantity of interest. We see that the likelihood is much more informative than the prior, and we would expect to see a different posterior only with a highly informative prior (possibly based on previous similar experiments).
oddsratio <- draws_bin2 |>
subset_draws(variable='oddsratio')
Prior and likelihood sensitivity plot shows posterior density estimate depending on amount of power-scaling. Overlapping line indicate low sensitivity and wider gaps between line indicate greater sensitivity.
powerscale_sequence(fit_bin2, prediction = \(x, ...) oddsratio) |>
powerscale_plot_dens(variables='oddsratio') +
# switch rows and cols
facet_grid(rows=vars(.data$variable),
cols=vars(.data$component)) +
# cleaning
ggtitle(NULL,NULL) +
labs(x='Odds-ratio', y=NULL) +
scale_y_continuous(breaks=NULL) +
theme(axis.line.y=element_blank(),
strip.text.y=element_blank()) +
# reference line
geom_vline(xintercept=1, linetype='dashed')
We can summarise the prior and likelihood sensitivity using cumulative Jensen-Shannon distance.
powerscale_sensitivity(fit_bin2, prediction = \(x, ...) oddsratio, num_args=list(digits=2)
)$sensitivity |>
filter(variable=='oddsratio') |>
mutate(across(where(is.double), ~num(.x, digits=2)))
# A tibble: 1 × 4
variable prior likelihood diagnosis
<chr> <num:.2!> <num:.2!> <chr>
1 oddsratio 0.01 0.14 -
Above we used formula y | trials(N) ~ 0 + grp2 to have separate model for control and treatment group. An alternative model y | trials(N) ~ grp2 which is equal to y | trials(N) ~ 1 + grp2, would correspond to a model $() = + x = + x. Now \(\alpha\) models the probability of death (via logistic link) in the control group and \(\alpha + \beta_\mathrm{treatment}\) models the probability of death (via logistic link) in the treatment group. Now the models for the groups are connected. Furthermore, if we set independent student_t(7, 0, 1.5) priors on \(\alpha\) and \(\beta_\mathrm{treatment}\), the implied priors on \(\theta_\mathrm{control}\) and \(\theta_\mathrm{treatment}\) are different. We can verify this with a prior simulation.
data.frame(theta_control = plogis(ggdist::rstudent_t(n=20000, df=7, mu=0, sigma=1.5))) |>
mcmc_hist() +
xlim(c(0,1)) +
labs(title='student_t(7, 0, 1.5) prior on Intercept') +
data.frame(theta_treatment = plogis(ggdist::rstudent_t(n=20000, df=7, mu=0, sigma=1.5))+
plogis(ggdist::rstudent_t(n=20000, df=7, mu=0, sigma=1.5))) |>
mcmc_hist() +
xlim(c(0,1)) +
labs(title='student_t(7, 0, 1.5) prior on Intercept and b_grp2treatment')
In this case, with relatively big treatment and control group, the likelihood is informative, and the difference between using y | trials(N) ~ 0 + grp2 or y | trials(N) ~ grp2 is negligible.
Third option would be a hierarchical model with formula y | trials(N) ~ 1 + (1 | grp2), which is equivalent to y | trials(N) ~ 1 + (1 | grp2), and corresponds to a model \(\mathrm{logit}(\theta) = \alpha \times 1 + \beta_\mathrm{control}\times x_\mathrm{control} + \beta_\mathrm{treatment}\times x_\mathrm{treatment}\), but now the prior on \(\beta_\mathrm{control}\) and \(\beta_\mathrm{treatment}\) is \(\mathrm{normal}(0, \sigma_\mathrm{grp})\). The default brms prior for \(\sigma_\mathrm{grp}\) is student_t(3, 0, 2.5). Now \(\alpha\) models the overall probablity of death (via logistic link), and \(\beta_\mathrm{control}\) and \(\beta_\mathrm{treatment}\) model the difference from that having the same prior. Prior for \(\beta_\mathrm{control}\) and \(\beta_\mathrm{treatment}\) includes unknown scale \(\sigma_\mathrm{grp}\). If the there is not difference between control and treatment groups, the posterior of \(\sigma_\mathrm{grp}\) has more mass near 0, and bigger the difference between control and treatment groups are, more mass there is away from 0. With just two groups, there is not much information about \(\sigma_\mathrm{grp}\), and unless there is a informative prior on \(\sigma_\mathrm{grp}\), two group hierarchical model is not that useful. Hierarchical models are more useful with more than two groups. In the following, we use the previously used student_t(7, 0,1.5) prior on intercept and the default brms prior student_t(3, 0, 2.5) on \(\sigma_\mathrm{grp}\).
fit_bin2 <- brm(y | trials(N) ~ 1 + (1 | grp2), family = binomial(), data = data_bin2,
prior = prior(student_t(7, 0,1.5), class='Intercept'),
seed = SEED, refresh = 0, control=list(adapt_delta=0.99))
Check the summary of the posterior and inference diagnostics. The summary reports that there are Group-Level Effects: ~grp2 with 2 levels (control and treatment), with sd(Intercept) denoting \(\sigma_\mathrm{grp}\). In addition, the summary lists Population-Level Effects: Intercept (\(\alpha\)) as in the prevous non-hierarchical models.
fit_bin2
Family: binomial
Links: mu = logit
Formula: y | trials(N) ~ 1 + (1 | grp2)
Data: data_bin2 (Number of observations: 2)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Group-Level Effects:
~grp2 (Number of levels: 2)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 1.64 1.45 0.15 5.69 1.00 549 932
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -2.21 1.24 -3.89 0.78 1.00 611 752
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
We can also look at the variable names brms uses internally
as_draws_rvars(fit_bin2)
# A draws_rvars: 1000 iterations, 4 chains, and 5 variables
$b_Intercept: rvar<1000,4>[1] mean ± sd:
[1] -2.2 ± 1.2
$sd_grp2__Intercept: rvar<1000,4>[1] mean ± sd:
[1] 1.6 ± 1.4
$r_grp2: rvar<1000,4>[2,1] mean ± sd:
Intercept
control -0.6 ± 1.2
treatment -1.2 ± 1.3
$lprior: rvar<1000,4>[1] mean ± sd:
[1] -4.2 ± 0.71
$lp__: rvar<1000,4>[1] mean ± sd:
[1] -13 ± 1.8
Although there is no difference, illustrate how to compute the oddsratio from hierarchical model
draws_bin2 <- as_draws_df(fit_bin2)
oddsratio <- draws_bin2 |>
mutate_variables(theta_control = plogis(b_Intercept + `r_grp2[control,Intercept]`),
theta_treatment = plogis(b_Intercept + `r_grp2[treatment,Intercept]`),
oddsratio = (theta_treatment/(1-theta_treatment))/(theta_control/(1-theta_control))) |>
subset_draws(variable='oddsratio')
oddsratio |> mcmc_hist() +
scale_x_continuous(breaks=seq(0.2,1.6,by=0.2))+
geom_vline(xintercept=1, linetype='dashed')
Make also prior sensitivity analysis with focus on oddsratio.
powerscale_sensitivity(fit_bin2, prediction = \(x, ...) oddsratio, num_args=list(digits=2)
)$sensitivity |>
filter(variable=='oddsratio') |>
mutate(across(where(is.double), ~num(.x, digits=2)))
# A tibble: 1 × 4
variable prior likelihood diagnosis
<chr> <num:.2!> <num:.2!> <chr>
1 oddsratio 0.01 0.14 -
Use the Kilpisjärvi summer month temperatures 1952–2022 data from aaltobda package
load(url('https://github.com/avehtari/BDA_course_Aalto/raw/master/rpackage/data/kilpisjarvi2022.rda'))
data_lin <- data.frame(year = kilpisjarvi2022$year,
temp = kilpisjarvi2022$temp.summer)
Plot the data
data_lin |>
ggplot(aes(year, temp)) +
geom_point(color=2) +
labs(x= "Year", y = 'Summer temp. @Kilpisjärvi') +
guides(linetype = "none")
To analyse has there been change in the average summer month temperature we use a linear model with Gaussian model for the unexplained variation. By default brms uses uniform prior for the coefficients.
Formula temp ~ year corresponds to model \(\mathrm{temp} ~ \mathrm{normal}(\alpha + \beta \times \mathrm{temp}, \sigma). The model could also be defined as `temp ~ 1 + year` which explicitly shows the intercept (\)$) part. Using the variable names brms uses the model can be written also as temp ~ normal(b_Intercept*1 + b_year*year, sigma). We start with the default priors to see some tricks that brms does behind the curtain.
fit_lin <- brm(temp ~ year, data = data_lin, family = gaussian(),
seed = SEED, refresh = 0)
Check the summary of the posterior and inference diagnostics.
fit_lin
Family: gaussian
Links: mu = identity; sigma = identity
Formula: temp ~ year
Data: data_lin (Number of observations: 71)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -34.69 12.49 -58.73 -10.19 1.00 3995 3035
year 0.02 0.01 0.01 0.03 1.00 3996 3035
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 1.08 0.09 0.91 1.28 1.00 3057 3011
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Convergence diagnostics look good. We see that posterior mean of Intercept is -34.7, which may sound strange, but that is the intercept at year 0, that is, very far from the data range, and thus doesn’t have meaningful interpretation directly. The posterior mean of year coefficient is 0.02, that is, we estimate that the summer temperature is increasing 0.02°C per year (which would make 1°C in 50 years).
We can check \(R^2\) which corresponds to the proporion of variance explained by the model. The linear model explains 0.16=16% of the total data variance.
bayes_R2(fit_lin) |> round(2)
Estimate Est.Error Q2.5 Q97.5
R2 0.16 0.07 0.03 0.3
We can check the all the priors used.
prior_summary(fit_lin)
prior class coef group resp dpar nlpar lb ub source
(flat) b default
(flat) b year (vectorized)
student_t(3, 9.5, 2.5) Intercept default
student_t(3, 0, 2.5) sigma 0 default
We see that class=b and coef=year have flat, that is, improper uniform prior, Intercept has student_t(3, 9.5, 2.5), and sigma has student_t(3, 0, 2.5) prior. In general it is good to use proper priors, but sometimes flat priors are fine and produce proper posterior (like in this case). Important part here is that by default, brms sets the prior on Intercept after centering the covariate values (design matrix). In this case, brms uses temp - mean(temp) = temp - 1987 instead of original years. This in general improves the sampling efficiency. As the Intercept is now defined at the middle of the data, the default Intercept prior is centered on median of the target (here target is year). If we would like to set informative priors, we need to set the informative prior on Intercept given the centered covariate values. We can turn of the centering by setting argument center=FALSE, and we can set the prior on original intercept by using a formula temp ~ 0 + Intercept + year. In this case, we are happy with the default prior for the intercept. In this specific casse, the flat prior on coefficient is also fine, but we add an weakly informative prior just for the illustration. Let’s assume we expect the temperature to change less than 1°C in 10 years. With student_t(3, 0, 0.03) about 95% prior mass has less than 0.1°C change in year, and with low degrees of freedom (3) we have thick tails making the likelihood dominate in case of prior-data conflict. In real life, we do have much more information about the temperature change, and naturally a hierarchical spatio-temporal model with all temperature measurement locations would be even better.
fit_lin <- brm(temp ~ year, data = data_lin, family = gaussian(),
prior = prior(student_t(3, 0, 0.03), class='b'),
seed = SEED, refresh = 0)
Check the summary of the posterior and inference diagnostics.
fit_lin
Family: gaussian
Links: mu = identity; sigma = identity
Formula: temp ~ year
Data: data_lin (Number of observations: 71)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -32.54 12.28 -56.70 -9.01 1.00 4183 3259
year 0.02 0.01 0.01 0.03 1.00 4182 3259
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 1.08 0.09 0.92 1.27 1.00 3494 2709
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Make prior sensitivity analysis by power-scaling both prior and likelihood.
powerscale_sensitivity(fit_lin)$sensitivity |>
mutate(across(where(is.double), ~num(.x, digits=2)))
# A tibble: 3 × 4
variable prior likelihood diagnosis
<chr> <num:.2!> <num:.2!> <chr>
1 b_Intercept 0.03 0.09 -
2 b_year 0.03 0.09 -
3 sigma 0.00 0.13 -
Our weakly informative proper prior has negligible sensitivity, and the likelihood is informative. Extract the posterior draws and check the summaries
draws_lin <- as_draws_df(fit_lin)
draws_lin |> summarise_draws()
# A tibble: 5 × 10
variable mean median sd mad q5 q95 rhat ess_bulk
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 b_Intercept -3.25e+1 -3.24e+1 1.23e+1 1.24e+1 -5.29e+1 -1.29e+1 1.00 4183.
2 b_year 2.11e-2 2.11e-2 6.18e-3 6.22e-3 1.12e-2 3.14e-2 1.00 4182.
3 sigma 1.08e+0 1.07e+0 9.14e-2 9.08e-2 9.43e-1 1.24e+0 1.00 3494.
4 lprior -1.08e+0 -1.06e+0 1.65e-1 1.65e-1 -1.38e+0 -8.51e-1 1.00 4173.
5 lp__ -1.07e+2 -1.06e+2 1.21e+0 9.72e-1 -1.09e+2 -1.05e+2 1.00 1899.
# ℹ 1 more variable: ess_tail <dbl>
If one of the columns is hidden we can force printing all columns
draws_lin |> summarise_draws() |> print(width=Inf)
# A tibble: 5 × 10
variable mean median sd mad q5 q95 rhat
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 b_Intercept -32.5 -32.4 12.3 12.4 -52.9 -12.9 1.00
2 b_year 0.0211 0.0211 0.00618 0.00622 0.0112 0.0314 1.00
3 sigma 1.08 1.07 0.0914 0.0908 0.943 1.24 1.00
4 lprior -1.08 -1.06 0.165 0.165 -1.38 -0.851 1.00
5 lp__ -107. -106. 1.21 0.972 -109. -105. 1.00
ess_bulk ess_tail
<dbl> <dbl>
1 4183. 3259.
2 4182. 3259.
3 3494. 2709.
4 4173. 3285.
5 1899. 2576.
Histogram of b_year
draws_lin |>
mcmc_hist(pars='b_year') +
xlab('Average temperature increase per year')
Probability that the coefficient b_year > 0 and the corresponding MCSE
draws_lin |>
mutate(I_b_year_gt_0 = b_year>0) |>
subset_draws(variable='I_b_year_gt_0') |>
summarise_draws(mean, mcse_mean)
# A tibble: 1 × 3
variable mean mcse_mean
<chr> <dbl> <dbl>
1 I_b_year_gt_0 1 NA
All posterior draws have b_year>0, the probability gets rounded to 1, and MCSE is not available as the obserevd posterior variance is 0.
95% posterior interval for temperature increase per 100 years
draws_lin |>
mutate(b_year_100 = b_year*100) |>
subset_draws(variable='b_year_100') |>
summarise_draws(~quantile(.x, probs = c(0.025, 0.975)),
~mcse_quantile(.x, probs = c(0.025, 0.975)),
.num_args = list(digits = 2, notation = "dec"))
# A tibble: 1 × 5
variable `2.5%` `97.5%` mcse_q2.5 mcse_q97.5
<chr> <dbl> <dbl> <dbl> <dbl>
1 b_year_100 0.93 3.33 0.03 0.03
Plot posterior draws of the linear function values at each year. add_linpred_draws() takes the years from the data and uses fit_lin to make the predictions.
data_lin |>
add_linpred_draws(fit_lin) |>
# plot data
ggplot(aes(x=year, y=temp)) +
geom_point(color=2) +
# plot lineribbon for the linear model
stat_lineribbon(aes(y = .linpred), .width = c(.95), alpha = 1/2, color=brewer.pal(5, "Blues")[[5]]) +
# decoration
scale_fill_brewer()+
labs(x= "Year", y = 'Summer temp. @Kilpisjärvi') +
theme(legend.position="none")+
scale_x_continuous(breaks=seq(1950,2020,by=10))
Alternativelly plot a spaghetti plot for 100 draws
data_lin |>
add_linpred_draws(fit_lin, ndraws=100) |>
# plot data
ggplot(aes(x=year, y=temp)) +
geom_point(color=2) +
# plot a line for each posterior draw
geom_line(aes(y=.linpred, group=.draw), alpha = 1/2, color = brewer.pal(5, "Blues")[[3]])+
# decoration
scale_fill_brewer()+
labs(x= "Year", y = 'Summer temp. @Kilpisjärvi') +
theme(legend.position="none")+
scale_x_continuous(breaks=seq(1950,2020,by=10))
Plot posterior predictive distribution at each year until 2030 add_predicted_draws() takes the years from the data and uses fit_lin to make the predictions.
data_lin |>
add_row(year=2023:2030) |>
add_predicted_draws(fit_lin) |>
# plot data
ggplot(aes(x=year, y=temp)) +
geom_point(color=2) +
# plot lineribbon for the linear model
stat_lineribbon(aes(y = .prediction), .width = c(.95), alpha = 1/2, color=brewer.pal(5, "Blues")[[5]]) +
# decoration
scale_fill_brewer()+
labs(x= "Year", y = 'Summer temp. @Kilpisjärvi') +
theme(legend.position="none")+
scale_x_continuous(breaks=seq(1950,2030,by=10))
Warning: Removed 32000 rows containing missing values (`geom_point()`).
Posterior predictive check with density overlays examines the whole temperature distribution
pp_check(fit_lin, type='dens_overlay', ndraws=20)
LOO-PIT check is good for checking whether the normal distribution is well describing the variation as it is examines the calibration of LOO predictive distributions conditonally on each year. LOO-PIT ploty looks good.
pp_check(fit_lin, type='loo_pit_qq', ndraws=4000)
The temperatures used in the above analyses are averages over three months, which makes it more likely that they are normally distributed, but there can be extreme events in the feather and we can check whether more robust Student’s \(t\) observation model would give different results (although LOO-PIT check did already indicate that the normal would be good).
fit_lin_t <- brm(temp ~ year, data = data_lin, family = student(),
prior = prior(student_t(3, 0, 0.03), class='b'),
seed = SEED, refresh = 0)
Check the summary of the posterior and inference diagnostics. The b_year posterior looks similar as before and the posterior for degrees of freedom nu has most of the posterior mass for quite large values indicating there is no strong support for thick tailed variation in average summer temperatures.
fit_lin_t
Family: student
Links: mu = identity; sigma = identity; nu = identity
Formula: temp ~ year
Data: data_lin (Number of observations: 71)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -34.01 12.27 -58.50 -9.31 1.00 3979 2893
year 0.02 0.01 0.01 0.03 1.00 3979 2923
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 1.03 0.10 0.86 1.24 1.00 3209 2302
nu 24.54 14.36 6.36 60.80 1.00 2972 2325
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
We can use leave-one-out cross-validation to compare the expected predictive performance.
LOO comparison shows normal and Student’s \(t\) model have similar performance.
loo_compare(loo(fit_lin), loo(fit_lin_t))
elpd_diff se_diff
fit_lin 0.0 0.0
fit_lin_t -0.4 0.3
Heteroskedasticity assumes that the variation around the linear mean can also vary. We can allow sigma to depend on year, too. Although the additional component is written as sigma ~ year, the log link function is used and the model is for log(sigma). bf() allows listing several formulas.
fit_lin_h <- brm(bf(temp ~ year,
sigma ~ year),
data = data_lin, family = gaussian(),
prior = prior(student_t(3, 0, 0.03), class='b'),
seed = SEED, refresh = 0)
Check the summary of the posterior and inference diagnostics. The b_year posterior looks similar as before. The posterior for sigma_year looks like having mosst of the ma for negative values, indicating decrease in temperature variation around the mean.
fit_lin_h
Family: gaussian
Links: mu = identity; sigma = log
Formula: temp ~ year
sigma ~ year
Data: data_lin (Number of observations: 71)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -36.37 12.49 -61.25 -10.49 1.00 3412 2842
sigma_Intercept 19.10 8.69 1.56 35.80 1.00 3818 2899
year 0.02 0.01 0.01 0.04 1.00 3426 2885
sigma_year -0.01 0.00 -0.02 -0.00 1.00 3810 2855
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Histogram of b_year and b_sigma_year
as_draws_df(fit_lin_h) |>
mcmc_areas(pars=c('b_year', 'b_sigma_year'))
As log(x) is almost linear when x is close to zero, we can see that the sigma is decreasing about 1% per year (95% interval from 0% to 2%).
Plot posterior predictive distribution at each year until 2030 add_predicted_draws() takes the years from the data and uses fit_lin_h to make the predictions.
data_lin |>
add_row(year=2023:2030) |>
add_predicted_draws(fit_lin_h) |>
# plot data
ggplot(aes(x=year, y=temp)) +
geom_point(color=2) +
# plot lineribbon for the linear model
stat_lineribbon(aes(y = .prediction), .width = c(.95), alpha = 1/2, color=brewer.pal(5, "Blues")[[5]]) +
# decoration
scale_fill_brewer()+
labs(x= "Year", y = 'Summer temp. @Kilpisjärvi') +
theme(legend.position="none")+
scale_x_continuous(breaks=seq(1950,2030,by=10))
Warning: Removed 32000 rows containing missing values (`geom_point()`).
Make prior sensitivity analysis by power-scaling both prior and likelihood.
powerscale_sensitivity(fit_lin_h)$sensitivity |>
mutate(across(where(is.double), ~num(.x, digits=2)))
# A tibble: 4 × 4
variable prior likelihood diagnosis
<chr> <num:.2!> <num:.2!> <chr>
1 b_Intercept 0.03 0.11 -
2 b_sigma_Intercept 0.00 0.10 -
3 b_year 0.03 0.11 -
4 b_sigma_year 0.00 0.11 -
We can use leave-one-out cross-validation to compare the expected predictive performance.
LOO comparison shows homoskedastic normal and heteroskedastic normal models have similar performances.
loo_compare(loo(fit_lin), loo(fit_lin_h))
elpd_diff se_diff
fit_lin_h 0.0 0.0
fit_lin -1.6 1.6
We can test the linearity assumption by using non-linear spline functions, by uing s(year) terms. Sampling is slower as the posterior gets more complex.
fit_spline_h <- brm(bf(temp ~ s(year),
sigma ~ s(year)),
data = data_lin, family = gaussian(),
seed = SEED, refresh = 0)
We get warnings about divergences, and try rerunning with higher adapt_delta, which leads to using smaller step sizes. Often adapt_delta=0.999 leads to very slow sampling, but with this small data, this is not an issue.
fit_spline_h <- update(fit_spline_h, control = list(adapt_delta=0.999))
Check the summary of the posterior and inference diagnostics. We’re not anymore able to make interpretation of the temperature increase based on this summary. For splines, we see prior scales sds for the spline coefficients.
fit_spline_h
Warning: There were 2 divergent transitions after warmup. Increasing
adapt_delta above 0.999 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Family: gaussian
Links: mu = identity; sigma = log
Formula: temp ~ s(year)
sigma ~ s(year)
Data: data_lin (Number of observations: 71)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Smooth Terms:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sds(syear_1) 1.05 1.03 0.03 3.82 1.00 1553 1839
sds(sigma_syear_1) 0.94 0.90 0.03 3.32 1.00 1233 1539
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 9.42 0.13 9.17 9.67 1.00 4196 3131
sigma_Intercept 0.04 0.09 -0.13 0.22 1.00 4866 2876
syear_1 2.93 2.65 -2.61 9.02 1.00 2011 1429
sigma_syear_1 -1.09 2.34 -6.15 3.79 1.00 1782 1182
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
We can still plot posterior predictive distribution at each year until 2030 add_predicted_draws() takes the years from the data and uses fit_lin_h to make the predictions.
data_lin |>
add_row(year=2023:2030) |>
add_predicted_draws(fit_spline_h) |>
# plot data
ggplot(aes(x=year, y=temp)) +
geom_point(color=2) +
# plot lineribbon for the linear model
stat_lineribbon(aes(y = .prediction), .width = c(.95), alpha = 1/2, color=brewer.pal(5, "Blues")[[5]]) +
# decoration
scale_fill_brewer()+
labs(x= "Year", y = 'Summer temp. @Kilpisjärvi') +
theme(legend.position="none")+
scale_x_continuous(breaks=seq(1950,2030,by=10))
Warning: Removed 32000 rows containing missing values (`geom_point()`).
And we can use leave-one-out cross-validation to compare the expected predictive performance.
LOO comparison shows homoskedastic normal linear and heteroskedastic normal spline models have similar performances. There are not enough observations to make clear difference between the models.
loo_compare(loo(fit_lin), loo(fit_spline_h))
Warning: Found 1 observations with a pareto_k > 0.7 in model 'fit_spline_h'. It
is recommended to set 'moment_match = TRUE' in order to perform moment matching
for problematic observations.
elpd_diff se_diff
fit_spline_h 0.0 0.0
fit_lin -0.4 1.8
For spline and other non-parametric models, we can use predictive estimates and predictions to get interpretable quantities. Let’s examine the difference of estimated average temperature in years 1952 and 2022.
temp_diff <- posterior_epred(fit_spline_h, newdata=filter(data_lin,year==1952|year==2022)) |>
rvar() |>
diff() |>
as_draws_df() |>
set_variables('temp_diff')
temp_diff <- data_lin |>
filter(year==1952|year==2022) |>
add_epred_draws(fit_spline_h) |>
pivot_wider(id_cols=.draw, names_from = year, values_from = .epred) |>
mutate(temp_diff = `2022`-`1952`,
.chain = (.draw - 1) %/% 1000 + 1,
.iteration = (.draw - 1) %% 1000 + 1) |>
as_draws_df() |>
subset_draws(variable='temp_diff')
Posterior distribution for average summer temperature increase from 1952 to 2022
temp_diff |>
mcmc_hist()
95% posterior interval for average summer temperature increase from 1952 to 2022
temp_diff |>
summarise_draws(~quantile(.x, probs = c(0.025, 0.975)),
~mcse_quantile(.x, probs = c(0.025, 0.975)),
.num_args = list(digits = 2, notation = "dec"))
# A tibble: 1 × 5
variable `2.5%` `97.5%` mcse_q2.5 mcse_q97.5
<chr> <dbl> <dbl> <dbl> <dbl>
1 temp_diff 0.55 2.58 0.03 0.02
Make prior sensitivity analysis by power-scaling both prior and likelihood with focus on average summer temperature increase from 1952 to 2022.
powerscale_sensitivity(fit_spline_h, prediction = \(x, ...) temp_diff, num_args=list(digits=2)
)$sensitivity |>
filter(variable=='temp_diff') |>
mutate(across(where(is.double), ~num(.x, digits=2)))
# A tibble: 1 × 4
variable prior likelihood diagnosis
<chr> <num:.2!> <num:.2!> <chr>
1 temp_diff 0.01 0.07 -
Probability that the average summer temperature has increased from 1952 to 2022 is 99.5%.
temp_diff |>
mutate(I_temp_diff_gt_0 = temp_diff>0,
temp_diff = NULL) |>
subset_draws(variable='I_temp_diff_gt_0') |>
summarise_draws(mean, mcse_mean)
# A tibble: 1 × 3
variable mean mcse_mean
<chr> <dbl> <dbl>
1 I_temp_diff_gt_0 0.998 0.000610
Load factory data, which contain 5 quality measurements for each of 6 machines. We’re interested in analysing are the quality differences between the machines.
factory <- read.table(url('https://raw.githubusercontent.com/avehtari/BDA_course_Aalto/master/rpackage/data-raw/factory.txt'))
colnames(factory) <- 1:6
factory
1 2 3 4 5 6
1 83 117 101 105 79 57
2 92 109 93 119 97 92
3 92 114 92 116 103 104
4 46 104 86 102 79 77
5 67 87 67 116 92 100
We pivot the data to long format
factory <- factory |>
pivot_longer(cols = everything(),
names_to = 'machine',
values_to = 'quality')
factory
# A tibble: 30 × 2
machine quality
<chr> <int>
1 1 83
2 2 117
3 3 101
4 4 105
5 5 79
6 6 57
7 1 92
8 2 109
9 3 93
10 4 119
# ℹ 20 more rows
As comparison make also pooled model
fit_pooled <- brm(quality ~ 1, data = factory, refresh=0)
Check the summary of the posterior and inference diagnostics.
fit_pooled
Family: gaussian
Links: mu = identity; sigma = identity
Formula: quality ~ 1
Data: factory (Number of observations: 30)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 92.91 3.25 86.47 99.37 1.00 3014 2161
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 18.34 2.50 14.24 23.80 1.00 2898 2673
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
As comparison make also seprate model. To make it completely separate we need to have different sigma for each machine, too.
fit_separate <- brm(bf(quality ~ 0 + machine,
sigma ~ 0 + machine),
data = factory, refresh=0)
Check the summary of the posterior and inference diagnostics.
fit_separate
Family: gaussian
Links: mu = identity; sigma = log
Formula: quality ~ 0 + machine
sigma ~ 0 + machine
Data: factory (Number of observations: 30)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
machine1 76.27 12.35 52.30 100.27 1.00 2073 1528
machine2 106.34 7.01 91.92 120.71 1.00 2492 1766
machine3 87.59 7.97 70.75 103.88 1.00 2124 1637
machine4 111.55 4.36 102.54 120.41 1.00 2353 2093
machine5 89.93 6.52 76.25 103.16 1.00 2913 1882
machine6 86.17 11.47 63.56 109.63 1.00 2673 1920
sigma_machine1 3.11 0.41 2.45 4.07 1.00 2429 1677
sigma_machine2 2.60 0.38 1.96 3.46 1.00 2683 2164
sigma_machine3 2.69 0.40 2.05 3.59 1.00 2109 1803
sigma_machine4 2.15 0.38 1.50 3.00 1.00 2721 1955
sigma_machine5 2.51 0.39 1.86 3.40 1.00 2651 2048
sigma_machine6 3.08 0.40 2.44 3.98 1.00 2764 1979
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
fit_hier <- brm(quality ~ 1 + (1 | machine),
data = factory, refresh = 0)
Check the summary of the posterior and inference diagnostics.
fit_hier
Family: gaussian
Links: mu = identity; sigma = identity
Formula: quality ~ 1 + (1 | machine)
Data: factory (Number of observations: 30)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Group-Level Effects:
~machine (Number of levels: 6)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 12.64 5.52 3.68 25.43 1.00 1241 1118
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 93.13 5.40 82.52 103.66 1.00 1353 1528
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 15.02 2.30 11.35 20.28 1.00 2239 2494
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
LOO comparison shows the hierarchical model is the best. The differences are small as the number of observations is small and there is a considerable prediction (aleatoric) uncertainty.
loo_compare(loo(fit_pooled), loo(fit_separate), loo(fit_hier))
Warning: Found 3 observations with a pareto_k > 0.7 in model 'fit_separate'. It
is recommended to set 'moment_match = TRUE' in order to perform moment matching
for problematic observations.
elpd_diff se_diff
fit_hier 0.0 0.0
fit_separate -3.0 2.8
fit_pooled -4.0 2.0
Different model posterior distributions for the mean quality. Pooled model ignores the varition between machines. Separate model doesn’t take benefit from the similariy of the machines and has higher uncertainty.
ph <- fit_hier |>
spread_rvars(b_Intercept, r_machine[machine,]) |>
mutate(machine_mean = b_Intercept + r_machine) |>
ggplot(aes(xdist=machine_mean, y=machine)) +
stat_halfeye() +
scale_y_continuous(breaks=1:6) +
labs(x='Quality', y='Machine', title='Hierarchical')
ps <- fit_separate |>
as_draws_df() |>
subset_draws(variable='b_machine', regex=TRUE) |>
set_variables(paste0('b_machine[', 1:6, ']')) |>
as_draws_rvars() |>
spread_rvars(b_machine[machine]) |>
mutate(machine_mean = b_machine) |>
ggplot(aes(xdist=machine_mean, y=machine)) +
stat_halfeye() +
scale_y_continuous(breaks=1:6) +
labs(x='Quality', y='Machine', title='Separate')
pp <- fit_pooled |>
spread_rvars(b_Intercept) |>
mutate(machine_mean = b_Intercept) |>
ggplot(aes(xdist=machine_mean, y=0)) +
stat_halfeye() +
scale_y_continuous(breaks=NULL) +
labs(x='Quality', y='All machines', title='Pooled')
(pp / ps / ph) * xlim(c(50,140))
Warning: Removed 646 rows containing missing values (`geom_slabinterval()`).
Make prior sensitivity analysis by power-scaling both prior and likelihood with focus on mean quality of each machine. We see no prior sensitivity.
machine_mean <- fit_hier |>
as_draws_df() |>
mutate(across(matches('r_machine'), ~ .x - b_Intercept)) |>
subset_draws(variable='r_machine', regex=TRUE) |>
set_variables(paste0('machine_mean[', 1:6, ']'))
powerscale_sensitivity(fit_hier, prediction = \(x, ...) machine_mean, num_args=list(digits=2)
)$sensitivity |>
filter(str_detect(variable,'machine_mean')) |>
mutate(across(where(is.double), ~num(.x, digits=2)))
# A tibble: 6 × 4
variable prior likelihood diagnosis
<chr> <num:.2!> <num:.2!> <chr>
1 machine_mean[1] 0.02 0.10 -
2 machine_mean[2] 0.02 0.07 -
3 machine_mean[3] 0.02 0.04 -
4 machine_mean[4] 0.02 0.10 -
5 machine_mean[5] 0.02 0.03 -
6 machine_mean[6] 0.02 0.05 -
Sorafenib Toxicity Dataset in metadat R package includes results from 13 studies investigating the occurrence of dose limiting toxicities (DLTs) at different doses of Sorafenib.
Load data
load(url('https://github.com/wviechtb/metadat/raw/master/data/dat.ursino2021.rda'))
head(dat.ursino2021)
study year dose events total
1 Awada 2005 100 0 4
2 Awada 2005 200 0 3
3 Awada 2005 300 1 5
4 Awada 2005 400 1 10
5 Awada 2005 600 7 12
6 Awada 2005 800 1 3
Number of patients per study
dat.ursino2021 |>
group_by(study) |>
summarise(N = sum(total)) |>
ggplot(aes(x=N, y=study)) +
geom_col(fill=4) +
labs(x='Number of patients per study', y='Study')
Distribution of doses
dat.ursino2021 |>
ggplot(aes(x=dose)) +
geom_histogram(breaks=seq(50,1050,by=100), fill=4, colour=1) +
labs(x='Dose (mg)', y='Count') +
scale_x_continuous(breaks=seq(100,1000,by=100))
Each study is using \(2--6\) different dose levels. Three studies that include only two dose levels are likelly to provide weak information on slope.
crosstab <- with(dat.ursino2021,table(dose,study))
data.frame(count=colSums(crosstab), study=colnames(crosstab)) |>
ggplot(aes(x=count, y=study)) +
geom_col(fill=4) +
labs(x='Number of dose levels per study', y='Study')
Pooled model assumes all studies have the same dose effect (reminder: ~ dose is equivalent to ~ 1 + dose)
fit_pooled <- brm(events | trials(total) ~ dose,
prior = c(prior(student_t(7, 0, 1.5), class='Intercept'),
prior(normal(0, 1), class='b')),
family=binomial(), data=dat.ursino2021)
Check the summary of the posterior and inference diagnostics.
fit_pooled
Family: binomial
Links: mu = logit
Formula: events | trials(total) ~ dose
Data: dat.ursino2021 (Number of observations: 49)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -3.18 0.39 -3.95 -2.44 1.00 1196 1997
dose 0.00 0.00 0.00 0.01 1.00 2389 2612
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Dose coefficient seems to be very small. Looking at the posterior, we see that it is positive with high probability.
fit_pooled |>
as_draws() |>
subset_draws(variable='b_dose') |>
summarise_draws(~quantile(.x, probs = c(0.025, 0.975)), ~mcse_quantile(.x, probs = c(0.025, 0.975)))
# A tibble: 1 × 5
variable `2.5%` `97.5%` mcse_q2.5 mcse_q97.5
<chr> <dbl> <dbl> <dbl> <dbl>
1 b_dose 0.00224 0.00524 0.0000459 0.0000467
The dose was reported in mg, and most values are in hundreds. It is often sensible to switch to a scale in which the range of values is closer to unit range. In this case it is natural to use g instead of mg.
dat.ursino2021 <- dat.ursino2021 |>
mutate(doseg = dose/1000)
Fit the pooled model again uing doseg
fit_pooled <- brm(events | trials(total) ~ doseg,
prior = c(prior(student_t(7, 0, 1.5), class='Intercept'),
prior(normal(0, 1), class='b')),
family=binomial(), data=dat.ursino2021)
Check the summary of the posterior and inference diagnostics.
fit_pooled
Family: binomial
Links: mu = logit
Formula: events | trials(total) ~ doseg
Data: dat.ursino2021 (Number of observations: 49)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -2.58 0.29 -3.15 -2.03 1.00 2268 2299
doseg 2.41 0.59 1.27 3.52 1.00 2760 2908
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Now it is easier to interpret the presented values. Separate model assumes all studies have different dose effect. It would be a bit complicated to set a different prior on study specific intercepts and other coefficients, so we use the same prior for all.
fit_separate <- brm(events | trials(total) ~ 0 + study + doseg:study,
prior=prior(student_t(7, 0, 1.5), class='b'),
family=binomial(), data=dat.ursino2021)
Check the summary of the posterior and inference diagnostics.
fit_separate
Family: binomial
Links: mu = logit
Formula: events | trials(total) ~ 0 + study + doseg:study
Data: dat.ursino2021 (Number of observations: 49)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
studyAwada -1.69 0.70 -3.17 -0.42 1.00 4611
studyBorthakurMA -2.26 0.88 -4.08 -0.66 1.00 5408
studyBorthakurMB -1.39 0.82 -3.11 0.19 1.00 5440
studyChen -2.24 0.97 -4.40 -0.56 1.00 4925
studyClark -1.64 0.81 -3.43 -0.20 1.00 5155
studyCrumpMA -2.01 0.79 -3.71 -0.56 1.00 5767
studyCrumpMB -1.57 0.75 -3.17 -0.23 1.00 5679
studyFuruse -2.65 0.95 -4.65 -0.97 1.00 5020
studyMiller -1.04 0.48 -1.99 -0.09 1.00 4969
studyMinami -2.25 0.78 -3.91 -0.87 1.00 5334
studyMoore -1.72 0.72 -3.26 -0.39 1.00 5726
studyNabors -1.82 0.94 -3.91 -0.18 1.00 5174
studyStrumberg -1.71 0.63 -3.03 -0.54 1.00 5135
studyAwada:doseg 1.67 1.32 -0.67 4.51 1.00 4675
studyBorthakurMA:doseg -0.03 1.51 -3.09 2.97 1.00 5653
studyBorthakurMB:doseg 0.08 1.44 -2.76 3.00 1.00 5736
studyChen:doseg -0.74 1.67 -4.27 2.31 1.00 5348
studyClark:doseg 1.56 1.40 -0.93 4.71 1.00 5108
studyCrumpMA:doseg -0.36 1.72 -3.93 3.00 1.00 6756
studyCrumpMB:doseg 0.25 1.39 -2.55 2.97 1.00 5179
studyFuruse:doseg -0.50 1.75 -4.17 2.83 1.00 6532
studyMiller:doseg 0.06 1.41 -2.70 2.85 1.00 5221
studyMinami:doseg -0.25 1.48 -3.20 2.60 1.00 6042
studyMoore:doseg 0.54 1.37 -2.07 3.37 1.00 6050
studyNabors:doseg 1.36 1.30 -0.93 4.19 1.00 5156
studyStrumberg:doseg 0.37 1.11 -1.75 2.61 1.00 5297
Tail_ESS
studyAwada 2825
studyBorthakurMA 2978
studyBorthakurMB 2981
studyChen 2504
studyClark 2585
studyCrumpMA 2948
studyCrumpMB 2985
studyFuruse 2916
studyMiller 3327
studyMinami 2628
studyMoore 2989
studyNabors 2283
studyStrumberg 3140
studyAwada:doseg 2413
studyBorthakurMA:doseg 2867
studyBorthakurMB:doseg 2864
studyChen:doseg 2785
studyClark:doseg 2613
studyCrumpMA:doseg 2650
studyCrumpMB:doseg 2704
studyFuruse:doseg 2916
studyMiller:doseg 3306
studyMinami:doseg 2665
studyMoore:doseg 2911
studyNabors:doseg 2444
studyStrumberg:doseg 3194
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
We build two different hierarchical models. The first one has hierarchical model for the intercept, that is, each study has a parameter telling how much that study differs from the common population intercept.
fit_hier1 <- brm(events | trials(total) ~ doseg + (1 | study),
prior=c(prior(student_t(7, 0, 1.5), class='Intercept'),
prior(normal(0, 1), class='b')),
family=binomial(), data=dat.ursino2021)
The second hierarchical model assumes that also the slope can vary between the studies.
fit_hier2 <- brm(events | trials(total) ~ doseg + (doseg | study),
prior=c(prior(student_t(7, 0, 1.5), class='Intercept'),
prior(normal(0, 1), class='b')),
family=binomial(), data=dat.ursino2021)
We seem some divergences due to highly varying posterior curvature. We repeat the sampling with higher adapt_delta, which adjust the step size to be smaller. Higher adapt_delta makes the computation slower, but that is not an issue in this case. If you get divergences with adapt_delta=0.99, it is likely that even larger values don’t help, and you need to consider different parameterisation, different model, or more informative priors.
fit_hier2 <- update(fit_hier2, control=list(adapt_delta=0.99))
LOO-CV comparison
loo_compare(loo(fit_pooled), loo(fit_separate), loo(fit_hier1), loo(fit_hier2))
Warning: Found 4 observations with a pareto_k > 0.7 in model 'fit_separate'. It
is recommended to set 'moment_match = TRUE' in order to perform moment matching
for problematic observations.
Warning: Found 2 observations with a pareto_k > 0.7 in model 'fit_hier2'. It is
recommended to set 'moment_match = TRUE' in order to perform moment matching
for problematic observations.
elpd_diff se_diff
fit_hier1 0.0 0.0
fit_pooled -0.9 1.7
fit_hier2 -1.2 0.6
fit_separate -15.1 2.8
We get warnings about several Pareto k’s > 0.7 in PSIS-LOO for separate model, but as in that case the LOO-CV estimate is usually overoptimistic and the separate model is the worst, there is no need to use more accurate computation for the separate model.
We get warnings about a few Pareto k’s > 0.7 in PSIS-LOO for both hierarchical models. We can improve the accuracy be running MCMC for these LOO folds. We use add_criterion() function to store the LOO computation results as they take a bit longer now. We get some divergences in case of the second hierarchical model, as leaving out an observation for a study that has only two dose levels is making the posterior having a difficult shape.
fit_hier1 <- add_criterion(fit_hier1, criterion='loo', reloo=TRUE)
fit_hier2 <- add_criterion(fit_hier2, criterion='loo', reloo=TRUE)
We repeat the LOO-CV comparison (without separate model). loo() function is useing the reults added to the fit objects.
loo_compare(loo(fit_pooled), loo(fit_hier1), loo(fit_hier2))
elpd_diff se_diff
fit_hier1 0.0 0.0
fit_pooled -0.9 1.7
fit_hier2 -1.1 0.6
The results did not change much. The first hierarchical model is slightly better than other models, but for predictive purposes there is not much difference (there is high aleatoric uncertainty in the predictions). Adding hiearchical model for the slope, decrased the predictive performance and thus it is likely that there is not enough information about the variation in slopes between studies.
Posterior predictive checking showing the observed and predicted number of events. Rootgram uses square root of counts on y-axis for better scaling. Rootogram is useful for count data when the range of counts is small or moderate.
pp_check(fit_pooled, type = "rootogram") +
labs(title='Pooled model')
pp_check(fit_hier1, type = "rootogram") +
labs(title='Hierarchical model')
pp_check(fit_hier2, type = "rootogram") +
labs(title='Hierarchical model')
We see that the hierarchical models have higher probability for future counts that are bigger than maximum observed count and longer predictive distribution tail. This is natural as uncertainty in the variation between tudies increases predictive uncertainty, too, especially as the number of studies is relatively small.
The population level coefficient posterior given pooled model
plot_posterior_pooled <- mcmc_areas(as_draws_df(fit_pooled), regex_pars='b_doseg') +
geom_vline(xintercept=0, linetype='dashed') +
labs(title='Pooled model')
The population level coefficient posterior given hierarchical model 1
plot_posterior_hier1 <- mcmc_areas(as_draws_df(fit_hier1), regex_pars='b_doseg') +
geom_vline(xintercept=0, linetype='dashed') +
labs(title='Hierarchical model 1')
The population level coefficient posterior given hierarchical model 3
plot_posterior_hier2 <- mcmc_areas(as_draws_df(fit_hier2), regex_pars='b_doseg') +
geom_vline(xintercept=0, linetype='dashed') +
labs(title='Hierarchical model 2')
(plot_posterior_pooled / plot_posterior_hier1 / plot_posterior_hier2) * xlim(c(0,8.5))
Warning: Removed 1 rows containing missing values (`geom_segment()`).
All models agree that the slope is very likely positive. The hierarchical models have more uncertainty, but also higher posterior mean.
When we look at the study specific parameters, we see that the Miller study has slightly higher intercept (leading to higher theta).
(mcmc_areas(as_draws_df(fit_hier1), regex_pars='r_study\\[.*Intercept') +
labs(title='Hierarchical model 1')) /
(mcmc_areas(as_draws_df(fit_hier2), regex_pars='r_study\\[.*Intercept') +
labs(title='Hierarchical model 2'))
There are no clear differences in slopes.
mcmc_areas(as_draws_df(fit_hier2), regex_pars='r_study\\[.*doseg') +
labs(title='Hierarchical model 2')
Based on LOO comparison we could continue with any of the models, but if we want to take into account the unknown possible study variations, it is best to continue with a hierarchical model. We continuw with the hierarchical model 1. The posterior for the probability of event given certain dose and a new study for hierarchical model 2.
data.frame(study='new',
doseg=seq(0.1,1,by=0.1),
total=1) |>
add_linpred_draws(fit_hier1, transform=TRUE, allow_new_levels=TRUE) |>
ggplot(aes(x=doseg, y=.linpred)) +
stat_lineribbon(.width = c(.95), alpha = 1/2, color=brewer.pal(5, "Blues")[[5]]) +
scale_fill_brewer()+
labs(x= "Dose (g)", y = 'Probability of event', title='Hierarchical model') +
theme(legend.position="none") +
geom_hline(yintercept=0) +
scale_x_continuous(breaks=seq(0.1,1,by=0.1)) +
ylim(c(0,0.15))
Warning: Removed 26209 rows containing missing values (`stat_slabinterval()`).
If we plot individual posterior draws, we see that there is a lot of uncertainty about the overall probability (explained by the variation in Intercept in different studies), but less uncertainty about the slope.
data.frame(study='new',
doseg=seq(0.1,1,by=0.1),
total=1) |>
add_linpred_draws(fit_hier1, transform=TRUE, allow_new_levels=TRUE, ndraws=100) |>
ggplot(aes(x=doseg, y=.linpred)) +
geom_line(aes(group=.draw), alpha = 1/2, color = brewer.pal(5, "Blues")[[3]])+
scale_fill_brewer()+
labs(x= "Dose (g)", y = 'Probability of event') +
theme(legend.position="none") +
geom_hline(yintercept=0) +
scale_x_continuous(breaks=seq(0.1,1,by=0.1))
Studies on Pharmacologic Treatments for Chronic Obstructive Pulmonary Disease includes results from 39 trials examining pharmacologic treatments for chronic obstructive pulmonary disease (COPD).
Load data
load(url('https://github.com/wviechtb/metadat/raw/master/data/dat.baker2009.rda'))
# force character strings to factors for easier ploting
dat.baker2009 <- dat.baker2009 |>
mutate(study = factor(study),
treatment = factor(treatment),
id = factor(id))
Look at six first lines of the data frame
head(dat.baker2009)
study year id treatment exac total
1 Llewellyn-Jones 1996 1996 1 Fluticasone 0 8
2 Llewellyn-Jones 1996 1996 1 Placebo 3 8
3 Boyd 1997 1997 2 Salmeterol 47 229
4 Boyd 1997 1997 2 Placebo 59 227
5 Paggiaro 1998 1998 3 Fluticasone 45 142
6 Paggiaro 1998 1998 3 Placebo 51 139
Total number of patients in each study varies a lot
dat.baker2009 |>
group_by(study) |>
summarise(N = sum(total)) |>
ggplot(aes(x=N, y=study)) +
geom_col(fill=4) +
labs(x='Number of patients per study', y='Study')
None of the treatments is included in every study, and each study includes \(2--4\) treatments.
crosstab <- with(dat.baker2009,table(study, treatment))
#
plot_treatments <- data.frame(number_of_studies=colSums(crosstab), treatment=colnames(crosstab)) |>
ggplot(aes(x=number_of_studies,y=treatment)) +
geom_col(fill=4) +
labs(x='Number of studies with a treatment X', y='Treatment') +
geom_vline(xintercept=nrow(crosstab), linetype='dashed') +
scale_x_continuous(breaks=c(0,10,20,30,39))
#
plot_studies <- data.frame(number_of_treatments=rowSums(crosstab), study=rownames(crosstab)) |>
ggplot(aes(x=number_of_treatments,y=study)) +
geom_col(fill=4) +
labs(x='Number of treatments in a study Y', y='Study') +
geom_vline(xintercept=ncol(crosstab), linetype='dashed') +
scale_x_continuous(breaks=c(0,2,4,6,8))
#
plot_treatments + plot_studies
The first model is pooling the information over studies, but estimating separate theta for each treatment (including placebo).
fit_pooled <- brm(exac | trials(total) ~ 0 + treatment,
prior = prior(student_t(7, 0, 1.5), class='b'),
family=binomial(), data=dat.baker2009)
Check the summary of the posterior and inference diagnostics.
fit_pooled
Family: binomial
Links: mu = logit
Formula: exac | trials(total) ~ 0 + treatment
Data: dat.baker2009 (Number of observations: 94)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat
treatmentBudesonide -0.31 0.09 -0.48 -0.12 1.00
treatmentBudesonidePFormoterol -0.49 0.10 -0.69 -0.30 1.00
treatmentFluticasone 0.35 0.04 0.28 0.43 1.00
treatmentFluticasonePSalmeterol 0.12 0.03 0.05 0.18 1.00
treatmentFormoterol -0.71 0.06 -0.83 -0.59 1.00
treatmentPlacebo -0.28 0.02 -0.32 -0.24 1.00
treatmentSalmeterol -0.38 0.03 -0.43 -0.33 1.00
treatmentTiotropium -0.90 0.03 -0.96 -0.84 1.00
Bulk_ESS Tail_ESS
treatmentBudesonide 5981 3028
treatmentBudesonidePFormoterol 6128 2602
treatmentFluticasone 6136 3117
treatmentFluticasonePSalmeterol 6435 3279
treatmentFormoterol 6886 2997
treatmentPlacebo 7009 2869
treatmentSalmeterol 6630 3101
treatmentTiotropium 6862 2780
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Treatment effect posteriors
fit_pooled |>
as_draws_df() |>
subset_draws(variable='b_', regex=TRUE) |>
set_variables(paste0('b_treatment[', levels(factor(dat.baker2009$treatment)), ']')) |>
as_draws_rvars() |>
spread_rvars(b_treatment[treatment]) |>
mutate(theta_treatment = rfun(plogis)(b_treatment)) |>
ggplot(aes(xdist=theta_treatment, y=treatment)) +
stat_halfeye() +
labs(x='theta', y='Treatment', title='Pooled over studies, separate over treatments')
Treatment effect odds-ratio posteriors
theta <- fit_pooled |>
as_draws_df() |>
subset_draws(variable='b_', regex=TRUE) |>
set_variables(paste0('b_treatment[', levels(factor(dat.baker2009$treatment)), ']')) |>
as_draws_rvars() |>
spread_rvars(b_treatment[treatment]) |>
mutate(theta_treatment = rfun(plogis)(b_treatment))
theta_placebo <- filter(theta,treatment=='Placebo')$theta_treatment[[1]]
theta |>
mutate(treatment_oddsratio = (theta_treatment/(1-theta_treatment))/(theta_placebo/(1-theta_placebo))) |>
filter(treatment != "Placebo") |>
ggplot(aes(xdist=treatment_oddsratio, y=treatment)) +
stat_halfeye() +
labs(x='Odds-ratio', y='Treatment', title='Pooled over studies, separate over treatments') +
geom_vline(xintercept=1, linetype='dashed')
We see a big variation between treatments and two treatments seem to be harmful, which is suspicious. Looking at the data we see that not all studies included all treatments, and thus if some of the studies had more events, then the above estimates can be wrong.
The target is discrete count, but as the range of counts is big, a rootogram would look messy, and density overlay plot is a better choice. Posterior predictive checking with kernel density estimates for the data and 10 posterior predictive replicates shows clear discrepancy.
pp_check(fit_pooled, type='dens_overlay')
Posterior predictive checking with PIT values and ECDF difference plot with envelope shows clear discrepancy.
pp_check(fit_pooled, type='pit_ecdf', ndraws=4000)
Posterior predictive checking with LOO-PIT values show clear discrepancy.
pp_check(fit_pooled, type='loo_pit_qq', ndraws=4000) +
geom_abline() +
ylim(c(0,1))
Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
Warning: Removed 9 rows containing missing values (`geom_point()`).
Warning: Removed 2 rows containing missing values (`geom_path()`).
The second model uses a hiearchical model both for treatment effects and study effects.
fit_hier <- brm(exac | trials(total) ~ (1 | treatment) + (1 | study),
family=binomial(), data=dat.baker2009)
Check the summary of the posterior and inference diagnostics.
fit_hier
Family: binomial
Links: mu = logit
Formula: exac | trials(total) ~ (1 | treatment) + (1 | study)
Data: dat.baker2009 (Number of observations: 94)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Group-Level Effects:
~study (Number of levels: 39)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 1.19 0.16 0.92 1.55 1.00 470 953
~treatment (Number of levels: 8)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.17 0.07 0.08 0.35 1.00 1150 1694
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -0.91 0.20 -1.31 -0.50 1.00 477 831
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
LOO-CV comparison
loo_compare(loo(fit_pooled), loo(fit_hier))
Warning: Found 24 observations with a pareto_k > 0.7 in model 'fit_pooled'. It
is recommended to set 'moment_match = TRUE' in order to perform moment matching
for problematic observations.
Warning: Found 22 observations with a pareto_k > 0.7 in model 'fit_hier'. It is
recommended to set 'moment_match = TRUE' in order to perform moment matching
for problematic observations.
elpd_diff se_diff
fit_hier 0.0 0.0
fit_pooled -1952.7 300.6
We get warnings about Pareto k’s > 0.7 in PSIS-LOO, but as the difference between the models is huge, we can be confident that the order would the same if we fixed the computation, and the hierarchical model is much better and there is high variation between studies. Clearly there are many highly influential observations.
Posterior predictive checking with kernel density estimates for the data and 10 posterior predictive replicates looks good (although with this many parameters, this check is likely to be optimistic).
pp_check(fit_hier, type='dens_overlay')
Posterior predictive checking with PIT values and ECDF difference plot with envelope looks good (although with this many parameters, this check is likely to be optimistic).
pp_check(fit_hier, type='pit_ecdf', ndraws=4000)
Posterior predictive checking with LOO-PIT values look good (alhough as there are Pareto-khat warnings, it is possible that this diagnostic is optimistic).
pp_check(fit_hier, type='loo_pit_qq', ndraws=4000) +
geom_abline() +
ylim(c(0,1))
Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
Warning: Removed 2 rows containing missing values (`geom_path()`).
Treatment effect posteriors have now much less variation.
fit_hier |>
spread_rvars(b_Intercept, r_treatment[treatment,]) |>
mutate(theta_treatment = rfun(plogis)(b_Intercept + r_treatment)) |>
ggplot(aes(xdist=theta_treatment, y=treatment)) +
stat_halfeye() +
labs(x='theta', y='Treatment', title='Hierarchical over studies, hierarchical over treatments')
Study effect posteriors show the expected high variation.
fit_hier |>
spread_rvars(b_Intercept, r_study[study,]) |>
mutate(theta_study = rfun(plogis)(b_Intercept + r_study)) |>
ggplot(aes(xdist=theta_study, y=study)) +
stat_halfeye() +
labs(x='theta', y='Study', title='Hierarchical over studies, hierarchical over treatments')
Treatment effect odds-ratio posteriors
theta <- fit_hier |>
spread_rvars(b_Intercept, r_treatment[treatment,]) |>
mutate(theta_treatment = rfun(plogis)(b_Intercept + r_treatment))
theta_placebo <- filter(theta,treatment=='Placebo')$theta_treatment[[1]]
theta |>
mutate(treatment_oddsratio = (theta_treatment/(1-theta_treatment))/(theta_placebo/(1-theta_placebo))) |>
filter(treatment != "Placebo") |>
ggplot(aes(xdist=treatment_oddsratio, y=treatment)) +
stat_halfeye() +
labs(x='Odds-ratio', y='Treatment', title='Hierarchical over studies, hierarchical over treatments') +
geom_vline(xintercept=1, linetype='dashed')
Treatment effect odds-ratios look now more reasonable. As now all treatments were compared to placebo, there is less overlap in the distributions as when looking at the thetas, as all thetas include similar uncertainty about the overall theta due to high variation between studies. The third model includes interaction so that the treatment can depend on study.
fit_hier2 <- brm(exac | trials(total) ~ (1 | treatment) + (treatment | study),
family=binomial(), data=dat.baker2009, control=list(adapt_delta=0.9))
LOO comparison shows
loo_compare(loo(fit_hier), loo(fit_hier2))
Warning: Found 22 observations with a pareto_k > 0.7 in model 'fit_hier'. It is
recommended to set 'moment_match = TRUE' in order to perform moment matching
for problematic observations.
Warning: Found 45 observations with a pareto_k > 0.7 in model 'fit_hier2'. It
is recommended to set 'moment_match = TRUE' in order to perform moment matching
for problematic observations.
elpd_diff se_diff
fit_hier2 0.0 0.0
fit_hier -0.7 3.4
We get warnings about Pareto k’s > 0.7 in PSIS-LOO, but as the models are similar, and the difference is small, we can be relatively confident that the more complex model is not better.